R introduction

Why R?

  • Powerful, open-source tool for data analysis and visualization
  • Widely used in research, reproducibility, and automation

Course Goals:

  • Build a solid foundation in R programming
  • Bridge the gap between raw data and meaningful biological insights

Rich Ecosystem of Packages

  • CRAN: Comprehensive R Archive Network
  • Bioconductor: Tools for bioinformatics
  • GitHub: User-contributed packages

Example packages:

  • dplyr for data manipulation
  • Biostrings for sequence analysis
  • GenomicRanges for genomic intervals
  • DESeq2 for differential expression analysis

Reporting in R

  • Combine text, code, and output in a single document
  • R Markdown can be rendered to HTML, PDF, Word, and more
  • R Markdown is the basis for this presentation!

Documentation

  • Comprehensive documentation for all functions
  • Vignettes provide detailed examples
  • Use ?function_name to access help
  • Use example(function_name) to see examples
  • Use vignette("package_name") to access package vignettes

Biostrings

library(Biostrings)
seq1 <- DNAString("ACGTACTTGCGTCGTCGTACG")
seq2 <- DNAString("ACGTAGGTCGTCGTCGTACG")
# create a pairwise alignment
alignment <- pairwiseAlignment(seq1, seq2)
print(alignment)
Global PairwiseAlignmentsSingleSubject (1 of 1)
pattern: ACGTACTTGCGTCGTCGTACG
subject: ACGTAGGT-CGTCGTCGTACG
score: 9.873037 
# search for a pattern in a sequence
matchPattern("GTC", seq1)
Views on a 21-letter DNAString subject
subject: ACGTACTTGCGTCGTCGTACG
views:
      start end width
  [1]    11  13     3 [GTC]
  [2]    14  16     3 [GTC]

GenomicRanges

library(GenomicRanges)
gr <- GRanges(
    seqnames = Rle(c("chr1", "chr2", "chr1", "chr3"), c(1, 3, 2, 4)),
    ranges = IRanges(101:110, end = 111:120, names = head(letters, 10)),
    strand = Rle(strand(c("-", "+", "*", "+", "-")), c(1, 2, 2, 3, 2)),
    score = 1:10,
    GC = seq(1, 0, length=10))
gr
GRanges object with 10 ranges and 2 metadata columns:
    seqnames    ranges strand |     score        GC
       <Rle> <IRanges>  <Rle> | <integer> <numeric>
  a     chr1   101-111      - |         1  1.000000
  b     chr2   102-112      + |         2  0.888889
  c     chr2   103-113      + |         3  0.777778
  d     chr2   104-114      * |         4  0.666667
  e     chr1   105-115      * |         5  0.555556
  f     chr1   106-116      + |         6  0.444444
  g     chr3   107-117      + |         7  0.333333
  h     chr3   108-118      + |         8  0.222222
  i     chr3   109-119      - |         9  0.111111
  j     chr3   110-120      - |        10  0.000000
  -------
  seqinfo: 3 sequences from an unspecified genome; no seqlengths

Visualizations in R

library(ggplot2)
data(iris)
ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width, color = Species)) +
  geom_point(size = 3)

Visualizations in R

library(ggplot2)
library(ggridges)
ggplot(diamonds, aes(x = price, y = cut, fill = cut)) +
  geom_density_ridges() +
  theme_ridges() +
  theme(legend.position = "none")

Visualizations in R

# heatmap
data(mtcars)
dat <- as.matrix(mtcars)
heatmap(dat, scale = "column", col = cm.colors(256))

Visualizations in R

Genomic data:

Interactive graphics

Basic R Syntax

  • Comments: #
  • Assignment: <- or = (preferred is <-)
# this is a comment
x <- 5

Basic R Syntax

  • Data types: numeric, character, logical
x <- 5
y <- "hello"
z <- TRUE

Basic R Syntax

  • Vectors: c()
  • Matrices: matrix()
x <- c(1, 2, 3)
print(x)
[1] 1 2 3
y <- matrix(1:9, nrow = 3)
print(y)
     [,1] [,2] [,3]
[1,]    1    4    7
[2,]    2    5    8
[3,]    3    6    9

Basic R Syntax

  • Data frames: data.frame()
print(head(iris))
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa

Basic R Syntax

  • Lists: list()
x <- list(1, "hello", TRUE)
print(x)
[[1]]
[1] 1

[[2]]
[1] "hello"

[[3]]
[1] TRUE

Basic R Syntax

  • Functions: function(), Calling functions: function_name()
  • Arguments: function_name(arg1 = value1, arg2 = value2)
x <- mean(c(1, 2, 3))
print(x)
[1] 2
y <- t.test(x = c(1, 2, 3), y = c(4, 5, 6))
print(y)

    Welch Two Sample t-test

data:  c(1, 2, 3) and c(4, 5, 6)
t = -3.6742, df = 4, p-value = 0.02131
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -5.2669579 -0.7330421
sample estimates:
mean of x mean of y 
        2         5 

Basic R Syntax

  • Control structures: if, else, for, while
  • Packages: library(), install.packages()
  • Help: ?, example(), vignette()